Labeling & Peer Grading: Your homework will be peer graded. To stay anonymous, avoid using your name and label your file with the last four digits of your student ID (e.g., HW#_Solutions_3938).
Submission: Submit both your IPython notebook (.ipynb) and an HTML file of the notebook to Canvas under Assignments → HW # → Submit Assignment. After submitting, download and check the files to make sure that you've uploaded the correct versions. Both files are required for your HW to be graded.
AI Use Policy: Solve each problem independently by yourself. Use AI tools like ChatGPT or Google Gemini for brainstorming and learning only—copying AI-generated content is prohibited. You do not neeViolations will lead to penalties, up to failing the course.
Problem Structure: Break down each problem ( already done in most problems) into three interconnected parts and implement each in separate code cells. Ensure that each part logically builds on the previous one. Include comments in your code to explain its purpose, followed by a Markdown cell analyzing what was achieved. After completing all parts, add a final Markdown cell reflecting on your overall approach, discussing any challenges faced, and explaining how you utilized AI tools in your process.
Upload the sn_ids.csv and do the following. Make sure to explain all the details for each part.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import json
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 20)
# Load the config file
with open('config.json', 'r') as f:
config = json.load(f)
data_loc = config["data_loc"]
file_name = "sn_ids.csv"
sn_df = pd.read_csv(data_loc + file_name)
rows, columns = sn_df.shape
print(f"The dataset contains {rows:,} rows and {columns} columns")
sn_df.head(5)
The dataset contains 289,003 rows and 2 columns
| id_1 | id_2 | |
|---|---|---|
| 0 | 0 | 23977 |
| 1 | 1 | 34526 |
| 2 | 1 | 2370 |
| 3 | 1 | 14683 |
| 4 | 1 | 29982 |
import networkx as nx
import matplotlib.pyplot as plt
from sknetwork.data import Bunch
from sknetwork.ranking import PageRank
from scipy.sparse import csr_matrix
# Convert the data to a directed graph
G = nx.from_pandas_edgelist(sn_df, 'id_1', 'id_2', create_using=nx.DiGraph)
# Convert the NetworkX graph to a sparse CSR matrix
adjacency = csr_matrix(nx.to_scipy_sparse_array(G, dtype=None, weight='weight', format='csr'))
names = np.array(list(G.nodes()))
graph = Bunch()
graph.adjacency = adjacency
graph.names = names
# Apply the PageRank algorithm
pagerank = PageRank()
pagerank.fit(adjacency)
scores = pagerank.scores_
scores = [round(score, 3) for score in scores]
# Convert the PageRank scores to a DataFrame
pagerank_df = pd.DataFrame({'ids': names, 'PageRank': scores}).sort_values(by='PageRank', ascending=False).reset_index(drop=True)
pagerank_df.head()
| ids | PageRank | |
|---|---|---|
| 0 | 31890 | 0.015 |
| 1 | 36652 | 0.013 |
| 2 | 18163 | 0.010 |
| 3 | 36628 | 0.010 |
| 4 | 34114 | 0.006 |
# Select the top 5 IDs
top_5_ids = pagerank_df['ids'].head(5).tolist()
# Filter data based on the top 5 IDs
filtered_df = sn_df[(sn_df['id_1'].isin(top_5_ids)) | (sn_df['id_2'].isin(top_5_ids))]
# filtered_df = filtered_df[0:2000]
filtered_df
| id_1 | id_2 | |
|---|---|---|
| 22 | 6 | 31890 |
| 38 | 34957 | 31890 |
| 45 | 8 | 36652 |
| 69 | 10 | 31890 |
| 123 | 11 | 31890 |
| ... | ... | ... |
| 288973 | 12628 | 34114 |
| 288976 | 34114 | 37535 |
| 288977 | 34114 | 37431 |
| 288978 | 34114 | 37460 |
| 288979 | 34114 | 2730 |
15631 rows × 2 columns
# Create the directed graph from the filtered data
G = nx.from_pandas_edgelist(filtered_df, 'id_1', 'id_2', create_using=nx.DiGraph())
# Set up plot with figsize of 50x50
plt.figure(figsize=(50, 50))
# Position the nodes using a spring layout
pos = nx.spring_layout(G, k=0.1)
# Draw nodes with different colors for "id_1" and "id_2"
id_1_nodes = filtered_df['id_1'].unique()
id_2_nodes = filtered_df['id_2'].unique()
# Draw nodes from "id_1" in blue and nodes from "id_2" in green
nx.draw_networkx_nodes(G, pos, nodelist=id_1_nodes, node_color='skyblue', node_size=800)
nx.draw_networkx_nodes(G, pos, nodelist=id_2_nodes, node_color='red', node_size=800)
# Draw edges
nx.draw_networkx_edges(G, pos, edge_color='black', width=1)
# Set the plot title
plt.title(f"Network Graph Based on PageRank Score", size=80)
plt.show()
Based from the above network graph, is hard to derive conlusions and insights due to the high volume of the dataset.
The initial dataset has 289,003 connections. After selecting the top 5 IDs based on the PageRank score and use them to filter the original sn_ids.csv based on 'id_1' and 'id_2', the volume of the dataset reduces to 15,631 connections. This is represents a 94.59% data volume reduction.
To make the graph more interpretable, a further reduction or sampling of the filtered data is necessary. By visualizing a smaller subset, we can better observe the hierarchical or clustered structures around these influential nodes. This refined approach facilitates a clearer understanding of relationships and potential key influencers in the network, offering valuable insights for applications like social network analysis or recommendation systems.
# Select the top 5 IDs
top_5_ids = pagerank_df['ids'].head(5).tolist()
# Filter data based on the top 5 IDs
filtered_df = sn_df[(sn_df['id_1'].isin(top_5_ids)) | (sn_df['id_2'].isin(top_5_ids))]
filtered_df = filtered_df[0:2000]
# Create the directed graph from the filtered data
G = nx.from_pandas_edgelist(filtered_df, 'id_1', 'id_2', create_using=nx.DiGraph())
# Set up plot with figsize of 50x50
plt.figure(figsize=(50, 50))
# Position the nodes using a spring layout
pos = nx.spring_layout(G, k=0.1)
# Draw nodes with different colors for "id_1" and "id_2"
id_1_nodes = filtered_df['id_1'].unique()
id_2_nodes = filtered_df['id_2'].unique()
# Draw nodes from "id_1" in blue and nodes from "id_2" in green
nx.draw_networkx_nodes(G, pos, nodelist=id_1_nodes, node_color='skyblue', node_size=800)
nx.draw_networkx_nodes(G, pos, nodelist=id_2_nodes, node_color='red', node_size=800)
# Draw edges
nx.draw_networkx_edges(G, pos, edge_color='black', width=1)
# Set the plot title
plt.title(f"Network Graph Based on PageRank Score with a Subset of the Filtered Data", size=40)
plt.show()
This network graph, based on PageRank scores with a subset of filtered data, highlights the influence of key nodes (in red) within the network. These red nodes, positioned centrally, connect to multiple clusters, indicating their role as influential hubs. The structure shows a distinct hub-and-spoke pattern, with central nodes linking to numerous smaller nodes. This setup suggests a hierarchical relationship, where these hubs act as primary connectors, facilitating interaction across different parts of the network. Despite the data reduction, the high connectivity still demonstrates the importance of these influential nodes in maintaining the network's overall structure.
# Convert the data to a directed graph
G = nx.from_pandas_edgelist(sn_df, 'id_1', 'id_2', create_using=nx.DiGraph())
H = G.to_directed()
# Apply the HITS algorithm
hubs, authorities = nx.hits(H, max_iter = 50, normalized = True)
ids = list(authorities.keys())
hub_scores = [round(value, 3) for value in hubs.values()]
authorities_scores = [round(value, 3) for value in list(authorities.values())]
# Convert the authority scores into a DataFrame
authority_df = pd.DataFrame({
'ids': ids,
'Hub Score': hub_scores,
'Authority': authorities_scores
}).sort_values(by='Authority', ascending=False).reset_index(drop=True)
authority_df.head()
| ids | Hub Score | Authority | |
|---|---|---|---|
| 0 | 31890 | 0.001 | 0.010 |
| 1 | 35773 | 0.001 | 0.003 |
| 2 | 36652 | 0.000 | 0.002 |
| 3 | 19222 | 0.001 | 0.002 |
| 4 | 35008 | 0.000 | 0.002 |
# Select the top 5 IDs based on Authority values
top_5_ids = authority_df['ids'].head(5).tolist()
# Filter the original data based on the top 5 Authority IDs
filtered_df = sn_df[(sn_df['id_1'].isin(top_5_ids)) | (sn_df['id_2'].isin(top_5_ids))]
# filtered_df = filtered_df[0:2000]
filtered_df
| id_1 | id_2 | |
|---|---|---|
| 22 | 6 | 31890 |
| 28 | 7 | 35773 |
| 38 | 34957 | 31890 |
| 45 | 8 | 36652 |
| 69 | 10 | 31890 |
| ... | ... | ... |
| 288795 | 36652 | 33051 |
| 288796 | 36652 | 37649 |
| 288797 | 36652 | 25233 |
| 288798 | 36652 | 37672 |
| 288799 | 36652 | 37562 |
19647 rows × 2 columns
# Create a graph for the filtered data
G_filtered = nx.from_pandas_edgelist(filtered_df, 'id_1', 'id_2', create_using=nx.DiGraph())
# Set up plot with figsize of 50x50
plt.figure(figsize=(50, 50))
# Position the nodes using a spring layout
pos = nx.spring_layout(G_filtered, k=0.1)
# Draw nodes with different colors for "id_1" and "id_2"
id_1_nodes = filtered_df['id_1'].unique()
id_2_nodes = filtered_df['id_2'].unique()
# Draw nodes from "id_1" in blue and nodes from "id_2" in green
nx.draw_networkx_nodes(G_filtered, pos, nodelist=id_1_nodes, node_color='skyblue', node_size=800)
nx.draw_networkx_nodes(G_filtered, pos, nodelist=id_2_nodes, node_color='red', node_size=800)
# Draw edges
nx.draw_networkx_edges(G_filtered, pos, edge_color='black', width=0.5)
# Set the plot title
plt.title(f"Network Graph Based on HITS Authority", size=80)
plt.show()
This HITS-based network graph, like the previous PageRank graph, shows a highly dense structure with numerous connections, making it challenging to derive meaningful insights at this level of complexity. The red nodes likely represent high-authority nodes, surrounded by many other nodes connecting to them.
The structure reveals a layered organization with core nodes in the center and outer nodes connected to them, indicating a potential hierarchical or influential relationship. However, the density of connections obscures specific patterns, highlighting the need for further filtering or sampling to create a more interpretable visualization. By focusing on a representative subset of the data, we can better understand the core connection patterns and influence of high-authority nodes within this network.
# Select the top 5 IDs based on Authority values
top_5_ids = authority_df['ids'].head(5).tolist()
# Filter the original data based on the top 5 Authority IDs
filtered_df = sn_df[(sn_df['id_1'].isin(top_5_ids)) | (sn_df['id_2'].isin(top_5_ids))]
filtered_df = filtered_df[0:2000]
# Create a graph for the filtered data
G_filtered = nx.from_pandas_edgelist(filtered_df, 'id_1', 'id_2', create_using=nx.DiGraph())
# Set up plot with figsize of 50x50
plt.figure(figsize=(50, 50))
# Position the nodes using a spring layout
pos = nx.spring_layout(G_filtered, k=0.1)
# Draw nodes with different colors for "id_1" and "id_2"
id_1_nodes = filtered_df['id_1'].unique()
id_2_nodes = filtered_df['id_2'].unique()
# Draw nodes from "id_1" in blue and nodes from "id_2" in green
nx.draw_networkx_nodes(G_filtered, pos, nodelist=id_1_nodes, node_color='skyblue', node_size=800)
nx.draw_networkx_nodes(G_filtered, pos, nodelist=id_2_nodes, node_color='red', node_size=800)
# Draw edges
nx.draw_networkx_edges(G_filtered, pos, edge_color='black', width=0.5)
# Set the plot title
plt.title(f"Network Graph Based on HITS Authority", size=80)
plt.show()
This network graph of 2,000 connections highlights 5 high-authority nodes (in red) with dense clusters of blue nodes connected to them. The red nodes demonstrate strong influence, attracting multiple connections. Some blue nodes link to more than one high-authority node, acting as bridges across clusters and enhancing interconnectivity. This clearer view emphasizes the role of high-authority nodes in structuring the network and connecting communities.
Do the following using the Yahoo Finance package. As usual, write the analysis details and explain all that you do for each part.
import pandas as pd
import yfinance as yf
import numpy as np
import requests
top20_tickers = ["AAPL", "AMZN", "MSFT", "GOOG", "GOOGL", "META", "TSLA", "NVDA", "JPM", "JNJ", "V", "PG",
"UNH", "HD", "MA", "BAC", "DIS", "PYPL", "NFLX", "ADBE"]
# Initialize an empty DataFrame to store results
all_holders = pd.DataFrame()
for ticker in top20_tickers:
stock = yf.Ticker(ticker)
institutional_holders = stock.institutional_holders
institutional_holders['Ticker'] = ticker
all_holders = pd.concat([all_holders, institutional_holders.head(10)], ignore_index=True)
all_holders = all_holders[['Ticker', 'Holder', 'Value']]
rows, columns = all_holders.shape
print(f"The dataset contains {rows:,} rows and {columns} columns")
all_holders.head(5)
The dataset contains 200 rows and 3 columns
| Ticker | Holder | Value | |
|---|---|---|---|
| 0 | AAPL | Vanguard Group Inc | 252876459508 |
| 1 | AAPL | Blackrock Inc. | 201659137420 |
| 2 | AAPL | Berkshire Hathaway, Inc | 177591247296 |
| 3 | AAPL | State Street Corporation | 112288817516 |
| 4 | AAPL | FMR, LLC | 59561715772 |
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
from matplotlib.cm import ScalarMappable
from IPython.display import SVG
from sknetwork.visualization import svg_graph
from sknetwork.data import Bunch
from sknetwork.ranking import PageRank
from scipy.sparse import csr_matrix
holder_color = "skyblue"
ticker_color = "red"
holder_size = 1000
# Normalize holding values for edge thickness
all_holders['Normalized Value'] = all_holders['Value'] / all_holders['Value'].max()
# Create directed graph
G = nx.from_pandas_edgelist(all_holders, 'Holder', 'Ticker', ['Value'], create_using=nx.DiGraph())
# Set up plot with figsize of 50x50
plt.figure(figsize=(40, 30))
# Draw nodes
holders = [node for node in G.nodes() if node in all_holders['Holder'].values]
tickers = [node for node in G.nodes() if node in all_holders['Ticker'].values]
# Define positions using spring layout for a more organic structure
pos = nx.spring_layout(G, k=0.3, seed=42)
# Scale ticker sizes by degree (number of connections)
ticker_sizes = [G.degree(ticker) * 500 for ticker in tickers]
# Draw nodes from holders in blue and nodes from tickers in green
nx.draw_networkx_nodes(G, pos, nodelist=tickers, node_color=ticker_color, node_size=ticker_sizes)
nx.draw_networkx_nodes(G, pos, nodelist=holders, node_color=holder_color, node_size=holder_size)
# Draw labels for holders slightly outside the nodes
holder_labels = {node: node for node in holders}
holder_label_pos = {node: (pos[node][0], pos[node][1] + 0.05) for node in holders} # Offset for visibility
nx.draw_networkx_labels(G, holder_label_pos, labels=holder_labels, font_size=25, verticalalignment="bottom")
# Draw edges with thickness based on normalized holding values
edge_widths = [all_holders.loc[(all_holders['Holder'] == u) & (all_holders['Ticker'] == v), 'Normalized Value'].values[0] * 5 for u, v in G.edges()]
nx.draw_networkx_edges(G, pos, edge_color='black', width=edge_widths)
# Draw labels for tickers inside the nodes
ticker_labels = {node: node for node in tickers}
nx.draw_networkx_labels(G, pos, labels=ticker_labels, font_size=20, font_color="white")
# Create a legend
plt.scatter([], [], c=holder_color, label='Institutional Holders', s=400)
plt.scatter([], [], c=ticker_color, label='Ticker Symbols', s=400)
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, loc='upper right', fontsize=30)
# Set the plot title
plt.title(f"Institutional Ownership Network", size=50)
plt.show()
holder_color = "skyblue"
ticker_color = "red"
holder_size = 1000
# Normalize holding values for edge thickness
all_holders['Normalized Value'] = all_holders['Value'] / all_holders['Value'].max()
# Create directed graph
G = nx.from_pandas_edgelist(all_holders, 'Holder', 'Ticker', ['Value'], create_using=nx.DiGraph())
# Set up plot with figsize of 50x50
plt.figure(figsize=(40, 30))
# Draw nodes
holders = [node for node in G.nodes() if node in all_holders['Holder'].values]
tickers = [node for node in G.nodes() if node in all_holders['Ticker'].values]
# Use shell layout to position tickers in the center and holders outside
pos = nx.shell_layout(G, nlist=[tickers, holders])
# Scale ticker sizes by degree (number of connections)
ticker_sizes = [G.degree(ticker) * 500 for ticker in tickers]
# Draw nodes from holders in blue and nodes from tickers in green
nx.draw_networkx_nodes(G, pos, nodelist=tickers, node_color=ticker_color, node_size=ticker_sizes)
nx.draw_networkx_nodes(G, pos, nodelist=holders, node_color=holder_color, node_size=holder_size)
# Draw labels for holders slightly outside the nodes
holder_labels = {node: node for node in holders}
holder_label_pos = {node: (pos[node][0], pos[node][1] + 0.04) for node in holders} # Offset for visibility
nx.draw_networkx_labels(G, holder_label_pos, labels=holder_labels, font_size=25, verticalalignment="bottom")
# Draw edges with thickness based on normalized holding values
edge_widths = [all_holders.loc[(all_holders['Holder'] == u) & (all_holders['Ticker'] == v), 'Normalized Value'].values[0] * 5 for u, v in G.edges()]
nx.draw_networkx_edges(G, pos, edge_color='black', width=edge_widths)
# Draw labels for tickers inside the nodes
ticker_labels = {node: node for node in tickers}
nx.draw_networkx_labels(G, pos, labels=ticker_labels, font_size=20, font_color="white")
# Create a legend
plt.scatter([], [], c=holder_color, label='Institutional Holders', s=400)
plt.scatter([], [], c=ticker_color, label='Ticker Symbols', s=400)
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, loc='upper right', fontsize=30)
plt.title(f"Institutional Ownership Network: Central Stock Tickers and Outer Institutional Holders", size=45)
plt.show()
This network visualization highlights the relationships between major institutional holders (in blue) and top stock tickers (in red). By centralizing the stock tickers and placing institutional holders around them, the graph clearly illustrates how multiple institutions are connected to popular stocks. The varying edge thickness, based on normalized holding amounts, provides a visual cue of the strength of each institutional investment in a particular stock. The size of the nodes don't vary a lot because all stickers have 10 institutional holders. This layout emphasizes the most influential stocks in attracting significant institutional investment and the diverse range of institutions supporting these key assets.
Do the following and write the findings from your anslysis.
import newspaper
from newspaper import Article
from tqdm import tqdm
import wikipedia as wiki
from GoogleNews import GoogleNews
import pandas as pd
from datetime import datetime, timedelta
import re
import time
# Function to parse relative dates
def parse_relative_date(relative_date_str):
now = datetime.now()
if 'hour' in relative_date_str:
hours = int(re.search(r'\d+', relative_date_str).group())
return now - timedelta(hours=hours)
elif 'minute' in relative_date_str:
minutes = int(re.search(r'\d+', relative_date_str).group())
return now - timedelta(minutes=minutes)
elif 'day' in relative_date_str:
days = int(re.search(r'\d+', relative_date_str).group())
return now - timedelta(days=days)
elif 'week' in relative_date_str:
weeks = int(re.search(r'\d+', relative_date_str).group())
return now - timedelta(weeks=weeks)
else:
return now # If date format is not recognized, default to now
# Define list of stock tickers
top20_tickers = ["AAPL", "AMZN", "MSFT", "GOOG", "GOOGL", "META", "TSLA", "NVDA", "JPM", "JNJ",
"V", "PG", "UNH", "HD", "MA", "BAC", "DIS", "PYPL", "NFLX", "ADBE"]
# top20_tickers = ["AAPL", "AMZN"]
# Define date range for recent articles
googlenews = GoogleNews(lang='en')
googlenews.set_encode('utf-8')
# Initialize an empty list to store article data
articles_data = []
for ticker in tqdm(top20_tickers, desc="Fetching articles for stocks"):
# Search for news articles related to the ticker
googlenews.search(ticker)
# Collect articles from the first pages
news_results = []
unique_titles = set()
for page in range(1, 10): # Retrieve the first N pages
googlenews.getpage(page)
page_results = googlenews.result()
time.sleep(2)
# Filter duplicates by title
for news in page_results:
title = news.get('title', None)
# Check if title is unique and we haven't reached 25 articles
if title and title not in unique_titles:
unique_titles.add(title) # Add title to the set of seen titles
news_results.append(news) # Add the unique article to results
# Stop once we have 25 unique articles
if len(news_results) >= 25:
break
if len(news_results) >= 25:
break
# print(f"News Lenght: {len(news_results)}")
# Loop through each news result
for news in news_results:
relative_date = news.get('date', None)
title = news.get('title', None)
link = news.get('link', None)
media = news.get('media', None)
# Convert relative date to actual datetime
actual_date = parse_relative_date(relative_date) if relative_date else None
# Format the URL
if "&" in link:
link = link.split('&')[0]
# Try to fetch the journalist and content
journalist = None
content = None
try:
article = Article(link)
article.download()
article.parse()
journalist = article.authors if article.authors else None
content = article.text if article.text else title
except:
content = title # Use title if full article is not accessible
# Append data to the list
articles_data.append({
'Ticker': ticker,
'Relative Date': relative_date,
'Date': actual_date.strftime('%Y-%m-%d %H:%M:%S') if actual_date else 'N/A',
"Media": media,
'Journalist': ', '.join(journalist) if journalist else 'N/A',
'Article Content': content,
'Article Link': link
})
# Clear GoogleNews search
googlenews.clear()
# Convert the list to a DataFrame
articles_df = pd.DataFrame(articles_data)
rows, columns = articles_df.shape
print(f"The dataset contains {rows:,} rows and {columns} columns")
articles_df.head(5)
The dataset contains 500 rows and 7 columns
| Ticker | Relative Date | Date | Media | Journalist | Article Content | Article Link | |
|---|---|---|---|---|---|---|---|
| 0 | AAPL | 1 hour ago | 2024-11-04 06:43:12 | Simply Wall Street | N/A | Apple ( ) Full Year 2024 Results\n\nKey Financ... | https://simplywall.st/stocks/us/tech/nasdaq-aa... |
| 1 | AAPL | 1 hour ago | 2024-11-04 06:43:13 | Benzinga | David Pinsen | Remembering The Ultimate Example Of DEI\n\nIn ... | https://www.benzinga.com/markets/24/11/4171071... |
| 2 | AAPL | 2 hours ago | 2024-11-04 05:43:13 | FXLeaders | Skerdian Meta | aapl-usd\n\nStock markets including Apple stoc... | https://www.fxleaders.com/news/2024/11/04/look... |
| 3 | AAPL | 2 hours ago | 2024-11-04 05:43:14 | StreetInsider | N/A | BofA Securities Reiterates Buy Rating on Apple... | https://www.streetinsider.com/Analyst%2BCommen... |
| 4 | AAPL | 5 hours ago | 2024-11-04 02:43:15 | Defense World | Defense World Staff | Wealth Dimensions Group Ltd. lifted its stake ... | https://www.defenseworld.net/2024/11/04/wealth... |
# Load the config file
with open('config.json', 'r') as f:
config = json.load(f)
data_loc = config["data_loc"]
file_name = "stocks_news_articles.csv"
file_destination = data_loc +file_name
articles_df.to_csv(file_destination, index=False)
import matplotlib.pyplot as plt
from textblob import TextBlob
# articles_dff = pd.read_csv(file_destination)
# articles_dff = articles_dff.fillna(" ")
# articles_dff.head()
# Perform sentiment analysis
articles_df['Sentiment Score'] = articles_df['Article Content'].apply(lambda x: TextBlob(x).sentiment.polarity)
# Sort the DataFrame by sentiment score in descending order (most positive first)
articles_df = articles_df.sort_values(by='Sentiment Score', ascending=False).reset_index(drop=True)
# Display the sorted DataFrame
articles_df
| Ticker | Relative Date | Date | Media | Journalist | Article Content | Article Link | Sentiment Score | |
|---|---|---|---|---|---|---|---|---|
| 0 | META | 2 days ago | 2024-11-02 07:45:12 | YouTube | N/A | Magnificent Meta and Microsoft | https://www.youtube.com/watch%3Fv%3DTB_dDBD-DGQ | 1.000000 |
| 1 | MSFT | 2 days ago | 2024-11-02 07:44:14 | Seeking Alpha | N/A | Microsoft: Most Magnificent, Fairly Valued (NA... | https://seekingalpha.com/article/4731992-micro... | 0.733333 |
| 2 | V | 0 hours ago | 2024-11-04 07:47:15 | Business Standard | Business Standard | US election showdown: Latest polls show Harris... | https://www.business-standard.com/world-news/u... | 0.500000 |
| 3 | NVDA | 2 minutes ago | 2024-11-04 07:43:50 | Barron's | N/A | Nvidia, Apple, Sherwin-Williams, DJT, Talen En... | https://www.barrons.com/articles/stock-market-... | 0.500000 |
| 4 | BAC | 1 day ago | 2024-11-03 07:49:47 | MSN | N/A | Buffett's Berkshire Hathaway cuts Apple, BofA ... | https://www.msn.com/en-us/money/topstocks/buff... | 0.500000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 495 | META | 2 days ago | 2024-11-02 07:45:10 | Seeking Alpha | N/A | Meta: AI Train Isn't Slowing Down Anytime Soon | https://seekingalpha.com/article/4732314-meta-... | -0.155556 |
| 496 | AMZN | 6 minutes ago | 2024-11-04 07:37:40 | Barron's | N/A | Talen Stock Tumbles on Amazon Nuclear Power Se... | https://www.barrons.com/articles/talen-stock-p... | -0.155556 |
| 497 | PG | 1 day ago | 2024-11-03 07:47:48 | Jagran English | N/A | NEET PG Counselling 2024 Schedule: The Medical... | https://english.jagran.com/education/neet-pg-c... | -0.163636 |
| 498 | DIS | 6 hours ago | 2024-11-04 01:50:16 | Tech in Asia | N/A | If you're seeing this message, that means Java... | https://www.techinasia.com/news/disney-forms-b... | -0.200000 |
| 499 | PYPL | 3 days ago | 2024-11-01 07:50:46 | Seeking Alpha | N/A | PayPal: Just 3 Million Shy Of Making Another R... | https://seekingalpha.com/article/4731471-paypa... | -0.500000 |
500 rows × 8 columns
# Summary statistics
print("Summary Statistics of Sentiment Scores:")
print("Mean Sentiment Score:", articles_df['Sentiment Score'].mean())
print("Median Sentiment Score:", articles_df['Sentiment Score'].median())
print("Minimum Sentiment Score:", articles_df['Sentiment Score'].min())
print("Maximum Sentiment Score:", articles_df['Sentiment Score'].max())
Summary Statistics of Sentiment Scores: Mean Sentiment Score: 0.09884682181507255 Median Sentiment Score: 0.07927035330261134 Minimum Sentiment Score: -0.5 Maximum Sentiment Score: 1.0
The mean sentiment score is 9.98%, and the median is 7.93%, indicating a slightly positive overall sentiment.
The sentiment scores range from a minimum of -0.5 (most negative) to a maximum of 1 (most positive).
# Histogram of sentiment scores
plt.figure(figsize=(10, 6))
plt.hist(articles_df['Sentiment Score'], bins=20, color='skyblue', edgecolor='black')
plt.title("Distribution of Sentiment Scores")
plt.xlabel("Sentiment Score")
plt.ylabel("Number of Articles")
plt.show()
The histogram of sentiment scores shows a clustering around slightly positive scores, with fewer articles exhibiting high positivity or negativity. This distribution suggests that most articles have a balanced tone, with only a few articles at sentiment extremes.
# Categorize articles by sentiment
articles_df['Sentiment Category'] = articles_df['Sentiment Score'].apply(
lambda x: 'Positive' if x > 0 else ('Negative' if x < 0 else 'Neutral')
)
# Count the number of articles in each sentiment category
category_counts = articles_df['Sentiment Category'].value_counts()
print("\nNumber of Articles by Sentiment Category:")
print(category_counts)
Number of Articles by Sentiment Category: Sentiment Category Positive 404 Neutral 53 Negative 43 Name: count, dtype: int64
Positive Articles: The majority of articles (404) have a positive sentiment.\ Neutral Articles: A significant portion (53) is neutral, showing balanced or neutral news coverage.\ Negative Articles: A smaller number of articles (43) are negative, indicating that negative sentiment is less common.
# Group data by Ticker and Sentiment Category
sentiment_counts = articles_df.groupby(['Ticker', 'Sentiment Category']).size().unstack(fill_value=0).sort_values(by=['Negative', 'Neutral'], ascending=False)
# Plot a stacked bar chart
sentiment_counts.plot(kind='bar', stacked=True, figsize=(12, 8), color=['darkred', 'yellow', 'green'])
plt.title('Sentiment Distribution by Ticker')
plt.xlabel('Ticker')
plt.ylabel('Count of Articles')
plt.legend(title='Sentiment Category')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
This stacked bar chart illustrates the distribution of sentiment across different stock tickers. Notably, JPM, BAC, MA, and TSLA have a more significant proportion of negative (red) and neutral (yellow) sentiments, while many other stocks are predominantly positive (green). This trend may indicate that JPM, BAC, and TSLA are currently experiencing more mixed or negative media coverage, whereas stocks like GOOGL, ADBE, and META are almost entirely positive, suggesting a generally favorable sentiment around these tickers. This insight can be useful for identifying stocks that might face sentiment-driven volatility or positive momentum.
# Top 5 positive articles
top_positive = articles_df.head(5)
print("\nTop 5 Positive Articles:")
top_positive[['Ticker', 'Date', 'Article Content', 'Sentiment Score', 'Article Link']]
Top 5 Positive Articles:
| Ticker | Date | Article Content | Sentiment Score | Article Link | |
|---|---|---|---|---|---|
| 0 | META | 2024-11-02 07:45:12 | Magnificent Meta and Microsoft | 1.000000 | https://www.youtube.com/watch%3Fv%3DTB_dDBD-DGQ |
| 1 | MSFT | 2024-11-02 07:44:14 | Microsoft: Most Magnificent, Fairly Valued (NA... | 0.733333 | https://seekingalpha.com/article/4731992-micro... |
| 2 | V | 2024-11-04 07:47:15 | US election showdown: Latest polls show Harris... | 0.500000 | https://www.business-standard.com/world-news/u... |
| 3 | NVDA | 2024-11-04 07:43:50 | Nvidia, Apple, Sherwin-Williams, DJT, Talen En... | 0.500000 | https://www.barrons.com/articles/stock-market-... |
| 4 | BAC | 2024-11-03 07:49:47 | Buffett's Berkshire Hathaway cuts Apple, BofA ... | 0.500000 | https://www.msn.com/en-us/money/topstocks/buff... |
# Top 5 negative articles
top_negative = articles_df.tail(5)
print("\nTop 5 Negative Articles:")
top_negative[['Ticker', 'Date', 'Article Content', 'Sentiment Score', 'Article Link']]
Top 5 Negative Articles:
| Ticker | Date | Article Content | Sentiment Score | Article Link | |
|---|---|---|---|---|---|
| 495 | META | 2024-11-02 07:45:10 | Meta: AI Train Isn't Slowing Down Anytime Soon | -0.155556 | https://seekingalpha.com/article/4732314-meta-... |
| 496 | AMZN | 2024-11-04 07:37:40 | Talen Stock Tumbles on Amazon Nuclear Power Se... | -0.155556 | https://www.barrons.com/articles/talen-stock-p... |
| 497 | PG | 2024-11-03 07:47:48 | NEET PG Counselling 2024 Schedule: The Medical... | -0.163636 | https://english.jagran.com/education/neet-pg-c... |
| 498 | DIS | 2024-11-04 01:50:16 | If you're seeing this message, that means Java... | -0.200000 | https://www.techinasia.com/news/disney-forms-b... |
| 499 | PYPL | 2024-11-01 07:50:46 | PayPal: Just 3 Million Shy Of Making Another R... | -0.500000 | https://seekingalpha.com/article/4731471-paypa... |
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from imblearn.over_sampling import RandomOverSampler
articles_df.head(1)
| Ticker | Relative Date | Date | Media | Journalist | Article Content | Article Link | Sentiment Score | Sentiment Category | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | META | 2 days ago | 2024-11-02 07:45:12 | YouTube | N/A | Magnificent Meta and Microsoft | https://www.youtube.com/watch%3Fv%3DTB_dDBD-DGQ | 1.0 | Positive |
# Convert article content into feature vectors
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(articles_df['Article Content'])
# Use the TextBlob-labeled sentiment as the target variable
y = articles_df['Sentiment Category']
# Split the data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
# Evaluate the classifier on the test set
y_pred = nb_classifier.predict(X_test)
print("Classification Report for Naive Bayes Classifier:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
Classification Report for Naive Bayes Classifier:
precision recall f1-score support
Negative 0.23 0.25 0.24 12
Neutral 0.80 0.33 0.47 12
Positive 0.83 0.89 0.86 76
accuracy 0.75 100
macro avg 0.62 0.49 0.52 100
weighted avg 0.75 0.75 0.74 100
Accuracy Score: 0.75
# Adjust alpha (smoothing parameter)
nb_classifier = MultinomialNB(alpha=0.5)
nb_classifier.fit(X_train, y_train)
y_pred = nb_classifier.predict(X_test)
print("Classification Report for Naive Bayes Classifier with Alpha Adjustment:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
Classification Report for Naive Bayes Classifier with Alpha Adjustment:
precision recall f1-score support
Negative 0.27 0.25 0.26 12
Neutral 0.60 0.50 0.55 12
Positive 0.85 0.88 0.86 76
accuracy 0.76 100
macro avg 0.57 0.54 0.56 100
weighted avg 0.75 0.76 0.75 100
Accuracy Score: 0.76
# Oversample the training set to balance classes
ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)
# Train and evaluate with resampled data
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_resampled, y_train_resampled)
y_pred = nb_classifier.predict(X_test)
print("Classification Report with Oversampling:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
Classification Report with Oversampling:
precision recall f1-score support
Negative 0.56 0.42 0.48 12
Neutral 0.53 0.67 0.59 12
Positive 0.91 0.91 0.91 76
accuracy 0.82 100
macro avg 0.67 0.66 0.66 100
weighted avg 0.82 0.82 0.82 100
Accuracy Score: 0.82
# Predict sentiment on the full dataset
articles_df['Naive Bayes Sentiment'] = nb_classifier.predict(vectorizer.transform(articles_df['Article Content']))
# Count the distribution of Naive Bayes sentiment predictions
print("\nNaive Bayes Sentiment Distribution:")
print(articles_df['Naive Bayes Sentiment'].value_counts())
Naive Bayes Sentiment Distribution: Naive Bayes Sentiment Positive 391 Neutral 62 Negative 47 Name: count, dtype: int64
# Compare TextBlob and Naive Bayes sentiment labels
comparison_df = articles_df[["Ticker", "Date", "Media", "Article Content", "Article Link", "Sentiment Score", "Sentiment Category", "Naive Bayes Sentiment", "Agreement"]]
comparison_df['Agreement'] = comparison_df['Sentiment Category'] == comparison_df['Naive Bayes Sentiment']
agreement_percentage = comparison_df['Agreement'].mean() * 100
print(f"\nAgreement between TextBlob and Naive Bayes: {agreement_percentage:.2f}%")
print(f"Discrepancies: {comparison_df['Agreement'].value_counts()[1]} out of {comparison_df.shape[0]}")
comparison_df[comparison_df['Agreement'] == False].head()
Agreement between TextBlob and Naive Bayes: 92.60% Discrepancies: 37 out of 500
| Ticker | Date | Media | Article Content | Article Link | Sentiment Score | Sentiment Category | Naive Bayes Sentiment | Agreement | |
|---|---|---|---|---|---|---|---|---|---|
| 3 | NVDA | 2024-11-04 07:43:50 | Barron's | Nvidia, Apple, Sherwin-Williams, DJT, Talen En... | https://www.barrons.com/articles/stock-market-... | 0.5 | Positive | Neutral | False |
| 4 | BAC | 2024-11-03 07:49:47 | MSN | Buffett's Berkshire Hathaway cuts Apple, BofA ... | https://www.msn.com/en-us/money/topstocks/buff... | 0.5 | Positive | Negative | False |
| 5 | AAPL | 2024-11-04 04:43:26 | Seeking Alpha | Apple: No Reason For The Love (NASDAQ:AAPL) | https://seekingalpha.com/article/4732532-apple... | 0.5 | Positive | Negative | False |
| 11 | PG | 2024-11-04 07:47:45 | Republic World | Diljit Dosanjh Jaipur Concert: PG Students Sav... | https://www.republicworld.com/entertainment/ce... | 0.4 | Positive | Neutral | False |
| 13 | HD | 2024-11-04 04:48:38 | Offshore Energy | HD HHI kicks off autonomous ship demonstration... | https://www.offshore-energy.biz/hd-hhi-kicks-o... | 0.4 | Positive | Neutral | False |
# Calculate positive sentiment proportions for each stock based on Naive Bayes results
positive_sentiment_df = articles_df[articles_df['Naive Bayes Sentiment'] == 'Positive']
positive_proportions = positive_sentiment_df['Ticker'].value_counts() / articles_df['Ticker'].value_counts()
positive_proportions = positive_proportions.dropna().sort_values(ascending=False)
# Select top 10 stocks for portfolio
top_10_stocks = positive_proportions.head(10)
print("\nTop 10 Stocks for Portfolio based on Positive Sentiment Trends:")
print(top_10_stocks)
Top 10 Stocks for Portfolio based on Positive Sentiment Trends: Ticker GOOGL 0.96 PYPL 0.96 ADBE 0.96 NFLX 0.96 MSFT 0.92 META 0.88 AMZN 0.84 UNH 0.84 DIS 0.84 JNJ 0.84 Name: count, dtype: float64
The top 10 stocks for the portfolio, selected based on positive sentiment trends, show strong positive coverage, with Google (GOOGL), PayPal (PYPL), Adobe (ADBE), and Netflix (NFLX) leading at 96% positivity. These companies reflect strong sentiment, suggesting a good market perception and potential investor confidence. Microsoft (MSFT), Meta (META), and Amazon (AMZN) also rank high, with positivity scores above 80%. Such sentiment-driven analysis could signal stability and growth potential, making these stocks attractive candidates for the portfolio.
Upload the ratings.csv data set with movieId, userId, and the rating as columns and do the following.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 20)
# Load the config file
with open('config.json', 'r') as f:
config = json.load(f)
data_loc = config["data_loc"]
file_name = "ratings.csv"
ratings_df = pd.read_csv(data_loc + file_name)
rows, columns = ratings_df.shape
print(f"The dataset contains {rows:,} rows and {columns} columns")
ratings_df.head(5)
The dataset contains 100,836 rows and 3 columns
| userId | movieId | rating | |
|---|---|---|---|
| 0 | 1 | 1 | 4.0 |
| 1 | 1 | 3 | 4.0 |
| 2 | 1 | 6 | 4.0 |
| 3 | 1 | 47 | 5.0 |
| 4 | 1 | 50 | 5.0 |
# Identifying the movies less than 100 ratings
pop_movies = ratings_df['movieId'].value_counts()
pop_movies = pop_movies[pop_movies > 100].index
# Filtering them out
ratings_filtered = ratings_df[ratings_df['movieId'].isin(pop_movies)]
ratings_filtered
| userId | movieId | rating | |
|---|---|---|---|
| 0 | 1 | 1 | 4.0 |
| 2 | 1 | 6 | 4.0 |
| 3 | 1 | 47 | 5.0 |
| 4 | 1 | 50 | 5.0 |
| 7 | 1 | 110 | 4.0 |
| ... | ... | ... | ... |
| 100217 | 610 | 48516 | 5.0 |
| 100310 | 610 | 58559 | 4.5 |
| 100326 | 610 | 60069 | 4.5 |
| 100380 | 610 | 68954 | 3.5 |
| 100452 | 610 | 79132 | 4.0 |
19788 rows × 3 columns
# Create a utility matrix with users as rows and movies as columns
utility_matrix = ratings_filtered.pivot(index='userId', columns='movieId', values='rating')
utility_matrix[0:5]
| movieId | 1 | 2 | 6 | 10 | 32 | 34 | 39 | 47 | 50 | 110 | ... | 7153 | 7361 | 7438 | 8961 | 33794 | 48516 | 58559 | 60069 | 68954 | 79132 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| userId | |||||||||||||||||||||
| 1 | 4.0 | NaN | 4.0 | NaN | NaN | NaN | NaN | 5.0 | 5.0 | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.5 | NaN | NaN | 4.0 |
| 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN | 2.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 4.0 | NaN | NaN | NaN | NaN | 4.0 | 3.0 | NaN | 4.0 | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 134 columns
utility_matrix_df = pd.DataFrame(utility_matrix)
num_missing_ratings = utility_matrix_df.isna().sum().sum()
num_missing_ratings
60210
# Calculate the total number of values in the matrix
total_values = utility_matrix.size
total_values
79998
# Calculate the percentage of missing values
percent_missing = round((num_missing_ratings / total_values) * 100, 2)
# Display the results
print("Number of missing ratings:", num_missing_ratings)
print(f"Percentage of missing values: {percent_missing}%")
Number of missing ratings: 60210 Percentage of missing values: 75.26%
The utility matrix shows significant sparsity in the dataset, with 60,210 missing ratings, representing 75.26% of the data.
This high percentage of missing values is typical in real-world recommendation systems, where users interact with only a small fraction of available items. Such sparsity is a challenge but also highlights the value of collaborative filtering methods like SVD, which can predict missing ratings by uncovering latent patterns in user behavior and item characteristics.
The utility matrix’s structure allows us to systematically handle and analyze these missing values, making it a crucial tool for effective recommendations despite incomplete data.
from surprise import Dataset, Reader
from surprise import SVD
from surprise.model_selection import train_test_split, cross_validate
from surprise import accuracy
from surprise.accuracy import rmse
# Define a Reader with the appropriate rating scale
reader = Reader(rating_scale=(ratings_df['rating'].min(), ratings_df['rating'].max()))
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)
# Train-Test Split
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
# Initialize and Train the SVD Model
svd = SVD(n_epochs=20, lr_all=0.005, reg_all=0.2)
svd.fit(trainset)
# Evaluate the Model
predictions = svd.test(testset)
rmse_score = rmse(predictions)
print("RMSE:", rmse_score)
RMSE: 0.8805 RMSE: 0.8805328477373325
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 0.8689 0.8703 0.8709 0.8856 0.8747 0.8741 0.0061
MAE (testset) 0.6707 0.6711 0.6713 0.6794 0.6742 0.6733 0.0033
Fit time 0.91 1.02 0.87 1.07 0.88 0.95 0.08
Test time 0.31 0.20 2.42 0.10 0.10 0.62 0.90
{'test_rmse': array([0.86887032, 0.870294 , 0.87085634, 0.88561234, 0.87468257]),
'test_mae': array([0.67072436, 0.67109641, 0.67125213, 0.67938632, 0.67419942]),
'fit_time': (0.9120039939880371,
1.0182032585144043,
0.8686258792877197,
1.0672190189361572,
0.8846390247344971),
'test_time': (0.306441068649292,
0.20035314559936523,
2.415268898010254,
0.0992732048034668,
0.09607696533203125)}
The evaluation of the SVD algorithm across 5 folds shows consistent performance, with an average RMSE of 0.8741 and an MAE of 0.6733.
These metrics indicate that the model’s predictions are reasonably accurate, though some variability exists (as seen in the standard deviations). RMSE, being slightly higher, suggests the presence of occasional larger errors, whereas the lower MAE indicates generally stable performance with smaller average errors.
These results suggest that SVD provides a robust foundation for accurate recommendations, though further tuning may reduce occasional outliers in predictions.
# Select a user who has rated movies
user_id = ratings_df['userId'].sample(1).iloc[0] # Randomly selecting a userId for demonstration
# Get all movies and filter for those the user has not rated
all_movie_ids = ratings_df['movieId'].unique()
rated_movie_ids = ratings_df[ratings_df['userId'] == user_id]['movieId'].values
unrated_movie_ids = [movie_id for movie_id in all_movie_ids if movie_id not in rated_movie_ids]
# Predict ratings for movies the user hasn't rated
predictions = [svd.predict(user_id, movie_id) for movie_id in unrated_movie_ids]
# Sort predictions by estimated rating in descending order and select top 5
top_5_recommendations = sorted(predictions, key=lambda x: x.est, reverse=True)[:5]
print(f"Top 5 movie recommendations for user {user_id}:")
for pred in top_5_recommendations:
print(f"Movie ID: {pred.iid}, Predicted Rating: {pred.est:.2f}")
Top 5 movie recommendations for user 414: Movie ID: 898, Predicted Rating: 4.06 Movie ID: 1283, Predicted Rating: 4.04 Movie ID: 3435, Predicted Rating: 4.04 Movie ID: 2160, Predicted Rating: 4.02 Movie ID: 306, Predicted Rating: 4.00
For this homework, I primarily relied on ChatGPT as my AI tool.
It often helps me a lot to understand and break down complex concepts into smaller, manageable parts, which I could then piece together to improve my understanding. The explanations are detailed and easy to follow, and this approach significantly enhanced my learning experience lately.
Additionally, ChatGPT helps a lot on my writing. It helps me to articulate my ideas more clearly and concisely. There was a particular instance where I spent about 25 minutes trying to formulate an explanation. After using ChatGPT to refine it, the core meaning of my original response remained intact, but the clarity and brevity were improved a lot. And every time I use it, I ask tips to improve my writting as well.
Another area where ChatGPT helped was with data visualization. Previously, I would spend considerable time searching for the right code on platforms like StackOverflow to generate plots. With ChatGPT, I was able to speed up this process, quickly getting the necessary plotting code and visualizing the data more efficiently.